PATTERN DISCOVERY TECHNIQUES FOR DETERMINING MAXIMAL 
IRREDUNDANT AND REDUNDANT MOTIFS 



Cross Reference to Related Applications 

This application claims the benefit of United States Provisional 
Application Number 60/292,241, filed May 18, 2001, the disclosure of which is 
incorporated by reference herein. 

Field of the Invention 

The present invention relates to pattern discovery and, more particularly, 
relates to pattern discovery techniques for determining maximal irredundant and 
redundant motifs. 

Background of the Invention 

Pattern or motif discovery in data is widely used as a means of 
understanding large volumes of data such as DeoxyriboNucleic Acid (DNA) or protein 
sequences. There are a variety of currently existing pattern discovery techniques. Many of 
these techniques discover "rigid" motifs. Some of these techniques have been extended to 
"flexible" motifs. 

A "rigid" motif is a repeating pattern that has the same length in every 
occurrence in an input sequence of data. The pattern contained in a rigid motif can 
contain "don't care" characters, which are generally symbolized by dots. A don't care 
character means that any character can occupy this particular location. For example, given 
a string s = abcdaXcdabbcd, the rigid motive m^a.cd occurs twice in the data, at 
positions 1 and 5 in s. A "flexible" motif is a repeating pattern that has a variable number 
of don't care characters. For instance, in the previous example, a flexible motif occurs 
three times, at positions 1, 5 and 9. At position 9, there would be two dot characters to 
represent two gaps instead of one. This flexible motif may be written as m = aS l > 2] cd, 
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where the [1,2] indicates that one or two don't care characters are allowed. 

Allowing motifs to have a variable number of don't care characters 
increases the number of discovered motifs but also increases discovery time and 
algorithm complexity. 

Typically, the higher the number of repeating patterns in a sequence, the 
higher the number of motifs in the data. Motif discovery on such data, such as repeating 
DNA or protein sequences, is a source of concern because these data exhibit a very high 
degree of self-similarity (i.e., repeating patterns). The number of rigid motifs could 
potentially be exponential in the size of the input sequence and, in the case where the 
input is a sequence of real numbers, there could be an infinite number of motifs 
(assuming two real numbers are equal if they are within some S of each other). 

Usually, this problem of a large number of motifs is tackled by 
pre-processing the input, using heuristics, to remove the repeating or self-similar portions 
of the input, or by using a "statistical significance" measure. These types of models, 
therefore, reduce the number of motifs to a more manageable level. However, due to the 
absence of a good understanding of the domain, there is no consensus over the right 
model to use. In other words, if the domain is DNA, there maybe insufficient information 
to know whether a statistical significance measure is correct. Consequently, important 
motifs may be discarded because of the particular statistical significance measure being 
used. Thus, there is a trend in different fields towards motif discovery that does not use 
models. 

There has been empirical evidence showing that the run-time for 
"model-less" motif discovery is linear in the output size for rigid motifs. However, none 
of the currently known algorithms has a proven output-sensitive complexity bound, and 
the only known complexity bounds are all exponential in the input size n. In other words, 
current pattern discovery algorithms depend on the size of the input, regardless of the size 
of the output of discovered patterns. This is important because one input may be the same 
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size as another, yet produce a much smaller output of discovered patterns. With current 
pattern discovery, pattern discovery for both of these inputs will take approximately the 
same amount of time. 

In order to apply motif discovery techniques to real life situations, one has 
to deal with the fact that, in many applications, the input is known with a margin of error. 
Many amino acids in protein sequences, for instance, are easily interchanged by evolution 
without loss of function. Also, the use of distance matrices in the context of DNA 
sequences is common. For example, a character a can be viewed as a or b for pattern 
detection purposes, but ab cannot be viewed as an a. In all these situations, it is possible 
to view the input as a string of sets of characters instead of just characters. For instance, a 
sequence of the form baccta can be viewed as b{a,b}cct{a,b}. In some other 
applications, the input is an array of real numbers, and two distinct real numbers are 
deemed identical for pattern detection purposes if they are within some given S > 0 of 
each other. Conventional motif discovery algorithms deal with these situations in an ad 
hoc manner, with no uniform framework, such that the same algorithm cannot tackle all 
the scenarios described above. 

Thus, what is needed are techniques that overcome the following 
problems: (1) the problem of flexible and rigid pattern discovery within reasonable 
complexity and time; (2) the problem of solely input-sensitive complexity; and (3) the 
problem of the non-uniform framework for real numbers and sets of strings of characters 
during pattern discovery. 

Summary of the Invention 

The present invention provides techniques for determining maximal 
motifs. These techniques have an output-sensitive portion and have proven complexity. 
Additionally, the techniques support pattern discovery for rigid patterns, flexible patterns, 
real-number patterns, and patterns having sets of characters for each element. Broadly, 
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from an input sequence, a set of basis motifs are determined. Then, using the basis motifs, 
a set of redundant motifs are determined. The redundant and basis motifs comprise a set 
of maximal motifs associated with and defined by the input sequence. 

In one aspect of the invention, basis motifs are determined through a 
technique that begins by creating small solid motifs and continues from there to create 
larger motifs that include "don't care" characters and that can include flexible portions. A 
solid motif is a rigid motif without "don't care" characters. A basis motif is an 
irredundant, maximal motif. The small solid motifs are concatenated to create larger 
motifs, and these larger motif can include don't care characters and flexible portions. This 
technique can be iterative. During each iteration, motifs are trimmed to remove redundant 
motifs and other motifs that do not meet certain criteria. If iterated, the process may be 
continued until no new motifs are determined. At this point, the basis set of motifs has 
been determined. 

In a second aspect of the invention, the basis motifs are used to construct 
the redundant motifs. The redundant motifs are formed by determining a number of 
subsets made of selected basis motifs. From these subsets of basis motifs, unique 
intersection sets are determined. The redundant motifs are determined from the unique 
intersection sets and motif sets created from the subsets of basis motifs. This process may 
also be iterative and can continue, by selecting additional basis motifs, until all basis 
motifs have been selected. 

A more complete understanding of the present invention, as well as further 
features and advantages of the present invention, will be obtained by reference to the 
following detailed description and drawings. 

Brief Description of the Drawings 

FIG. 1 is a block diagram of a system for determining a set of maximal 
motifs, in accordance with one embodiment of the present invention; 
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FIG. 2 illustrates a set of maximal motifs; 

FIG. 3 is a flowchart of a method for creating a basis set of motifs, in 
accordance with one embodiment of the present invention; 

FIG. 4 illustrates an example sequence input, some exemplary partial 
results of steps in the method of FIG. 3, and resultant basis motifs for the example 
sequence input; 

FIG. 5 is a flowchart of a method for creating a redundant set of motifs 
from a basis set of motifs, in accordance with one embodiment of the present invention; 

FIG. 6 illustrates an exemplary vector space of basis motifs and resulting 
sets from the vector space, in accordance with one embodiment of the present invention; 

FIG. 7 illustrates an exemplary tree used to determine unique intersections 
sets from the sets shown in FIG. 6, in accordance with one embodiment of the present 
invention; 

FIGS. 8 and 9 show additional exemplary vector spaces created from basis 
motifs, in accordance with one embodiment of the present invention; and 

FIG. 10 shows an exemplary system, in accordance with one embodiment 
of the present invention, suitable for performing the present invention. 

Detailed Description of Preferred Embodiments 

The present invention provides techniques to determine maximal motifs 
from an input sequence. Pattern discovery techniques in accordance with the present 
invention proceed in basically two phases. In the first phase, a set of maximal irredundant 
motifs are determined. These irredundant motifs are called "basis motifs" herein. Using 
the basis motifs, the present invention, in the second phase, determines a set of maximal 
redundant motifs. The result of both phases is a set of maximal irredundant and redundant 
motifs. It should be noted that, if desired, only the basis motifs may be determined. 

Some major benefits of the present invention are as follows: (1) the 
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complexity and pattern discovery time are low; (2) the complexity and discovery time are 
bounded; (3) the complexity and discovery time are related to the output for the 
redundant motif determination, which is phrased as being "output sensitive"; (4) the basis 
motifs provide a basis from which additional motifs may be determined; and (5) the 
present invention may be used for a wide variety of input sequences, including sequences 
containing real numbers or containing sets of characters. Importantly, the complexity of 
both phases of the present invention is proportional to the sizes of both the input and the 
output. More precisely, the complexity of both phases of the present invention is bounded 
by 0((n 5 +N)logn), where refers to a complexity on the order of the portion 

enclosed by parenthesis, the Wis the size of the output, and the n is the size of the input. 

It should be noted that the term motif is primarily used throughout the 
present description. However, the term pattern is also used and motif and pattern should 
be considered equivalent and interchangeable. 

Referring now to FIG. 1, a system 100 is shown for determining a set of 
maximal motifs 130 from an input sequence 105. System 100 comprises a basis motif 
determination operation 110 and a redundant motif determination operation 120. Input 
sequence 105 is a sequence of elements from an alphabet. Some exemplary input 
sequences are shown in FIG. 1. Sequence 106 is a series of letters from an alphabet of 
characters. Generally, the alphabet is a reduced set of characters from the English 
alphabet. However, any alphabet maybe used. Sequence 107 is a series of sets of letters, 
wherein each letter comes from an alphabet. Sequence 108 is a series of real numbers, for 
which the alphabet comprises integers. Real numbers maybe considered sets of integers. 
This is described in more detail below. 

The basis motif determination operation 110 is described in additional 
detail in reference to FIG. 3. Briefly, the basis motif determination operation 110 accepts 
a sequence 105 and determines, from the sequence 105, a set of irredundant maximal 
motifs. These irredundant, maximal motifs are called "basis motifs" herein and are 
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represented by basis motifs 115. Basis motifs 115 are unique motifs that can be used to 
form other motifs in a space defined by the input sequence 105. Simplistically, 
"irredundant" means that no basis motif 1 15 can be formed by a combination of any other 
basis motifs 115, and "maximal" means that a maximal motif is the largest motif 
comprising particular elements. These definitions are explained in greater detail below. 

Redundant motif determination operation 120 is described in more detail 
in reference to FIG. 5. Briefly, redundant motif determination operation 120 uses the 
basis motifs 115 to determine a set of redundant motifs 125. Redundant motifs 125 are 
maximal motifs but may be determined from a combination of basis motifs 115. 
Redundant motifs 125 and basis motifs 115 are combined in addition module 127 to 
create a set of maximal motifs 130. 

Importantly, both the basis motif determination operation 110 and the 
redundant motif determination operation 120 have a relatively low complexity compared 
to current pattern discovery methods. Moreover, system 100 has a running time that is 
linear in the size of the output (i.e., the set of maximal motifs 130). Thus, system 100 has 
an output-sensitive complexity bound, and this bound may be proven. 

Referring now to FIG. 2, a simplistic view of a set of maximal motifs 130 
is shown. Basis motifs 115 are basically "core" motifs that can be used to determine the 
redundant motifs 125. The set of maximal motifs 130 is basically a space defined by a 
particular input sequence. The basis motifs 115 are a set of motifs through which other 
redundant motifs 125 in the set of maximal motifs 130 may determined. FIG. 2 is 
discussed herein as an aid to understanding the present invention and should not be 
construed to be limiting. 

Before proceeding with more detailed discussions of basis and redundant 
motifs and their determination, it is useful to provide some definitions. 
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Preliminary Definitions 

Let s be a sequence of sets of characters from an alphabet E, V g E. The V 
is called a "don't care" or a dot character and any other element is called solid. Also, a 
will refer to a singleton character or a set of characters from E. For brevity of notation, a 
singleton set is not enclosed in curly braces. For example, let I, = {A,C,G,T}> then 
si=ACTGAT and s 2 = {A,T}CG{T,G} are two possible sequences. The j th (1 <j<\s\) 
element of the sequence is given by s\j]. For instance, in the previous example, 
s 2 [l] = {A, 7}, s 2 [2] = {C}, s 2 [3] = {G}, and j 2 [4] = {r,G}. Also, if x is a sequence, 
then \x\ denotes the length of the sequence, and, if x is a set of elements, then jx| denotes 
the cardinality of the set. Hence \s x \ = 6, \s 2 \ = 4, |$i[l]| = 1, and \s 2 [4]\ = 2. 

Definition l:(e\< e 2 ). The condition (e\<e 2 ) holds if and only if e\ is a 
"don't care" character or ei c^ 2 . 

The flexibility of a motif is due to the variability in the number of dot 
characters and flexibility is added by annotating the dot characters. 

Definition 2: annotated dot character, ". a ". An annotated " character is 
written as . a where a is a set of non-negative integers {ai,a 2 ,„. 9 a k } or an interval 
a = [eti, a u ] representing all integers between a\ and a u including aiand a u . 

To avoid clutter, the annotation superscript a will be an integer interval. 

Definition 3: rigid and flexible strings. Given a string s, if at least one dot 
element is annotated, m is called a flexible string; otherwise, m is called rigid. 

Definition 4: realization. Let p be a flexible string. A rigid string p' is a 
realization of p if each annotated dot element . a is replaced by / dot elements where lea. 

For example, if p = a. [3 > 6] b. [2 > 5] cde, then p f = a..b„.cde is a realization oip 
and so is p" = a...b...cde. 

Definition 5: p occurs at /. A rigid string p occurs at position / on s if 
p[j] ^ *U +j - 1] holds for 1 <y < |p|. A flexible string jr? occurs at position / in 5 if there 
exists a realization p r ofp that occurs at /. 
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If pis flexible, then p could possibly occur multiple times at a location on 
a string s. For example, if s = axbcbc, then p = aS^b occurs twice at position 1 as axbc 
(i.e., a.bc) and axbcbc (i.e., a.. .be). This multiplicity of occurrence increases the 
complexity of an algorithm that discovers flexible motifs over that of an algorithm that 
discovers rigid motifs. 

Definition 6: motif m and location list £ m . Given a string s on alphabet I 
and a positive integer k,k< \s\, a string (flexible or rigid) m is a motif with location list 
£ m =(h,h,...,lp), if m[\]±\\ m[\m\]±\', m occurs at each /e£ m , there exists no 
/', /' i. £ m , and m occurs at /' with p > k. The requirements for m[l] and m[\m\] 
are to ensure that the first and last characters of the motif are solid characters. If don't 
care characters are allowed at the ends, the motifs can be made arbitrarily long in size 
without conveying any extra information. 

Definition 7: realization of a motif m. Given a motif m on an input string s 
with a location list L m , and m' a realization of the string m, then m' is a realization of the 
motif m if and only if there exists some k g £ m such that m' occurs at k in s. 

Notice that because of the present notation of annotating a dot character 
with an integer interval, instead of a set of integers, not every realization of the flexible 
motif occurs in the input string. In the remaining description, this stricter definition of 
motif realization (Definition 7) will be used unless otherwise specified. 

Definition 8: (mi <m 2 ). Given two motifs m\ and m 2 , with |mi|<|wt2|, 
miim 2 holds if, for every realization m\ of motif m u there exists a realization m' 2 of 
motif m 2 such that m\ [/] < m' 2 [j], l<j<\mi\. 

For example, let m x =AB..E, m 2 =AK..E, and m 3 =ABC.E.G. Then 
m i < ntz, and m 2 £ m 3 . The following lemma is straightforward to verify. 

Definition 9: (m\=m 2 ). Given two motifs m x and m 2 with \mi\ = \m 2 \, 
mi=m 2 holds if, for every realization m' of motif m u there exists a realization m 2 of 
motif m 2 such that m[ [/'] = m' 2 [f], 1 <y < \m 1 1. 
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Lemma 1. If m\ < m 2 , then £ Ml 2 £ m2 . If m\ < m 2 and m 2 < mi, then 

mi < W3. 

Definition 10: sub-motifs of motif m. Given a motif m, let 
w&'iLw&y, be the / solid elements in the motif m. Then the sub-motifs of m are 
given as follows: for every j h j k , the sub-motif is obtained by dropping all the elements 
before (to the left of) j t and all elements after (to the right of) j k in m. 

Definition 1 1 : maximal motif. Let p\,p 2 , ...,pk be the motifs in a sequence 
s. Define p t ]j\ to be \\j > )p t \. A motif p t is maximal in composition if and only if there 
exists no pi, I * i with £ Pi = £ pi and p t < p t . A motif p h maximal in composition, is also 
maximal in length if and only if there exists no motif p h j * i, such that p t is a sub-motif 
of pj and \£pi\ = \£ pj \. A maximal motif is maximal both in composition and in length. 

Prerequisites 

It is quite clear that the number of maximal flexible motifs could be 
exponential in the size of the input s. It has been shown in that there is a small basis set 
of motifs of size 0(n) for every input of size n. This was shown in Parida, "Some Results 
on Flexible-Pattern Discovery," Proc. of the Eleventh Symp. on Comp. Pattern Matching, 
Lecture Notes in Comp. Science, vol. 1848, pages 33-45 (June 2000), the disclosure of 
which is incorporated herein by reference. The remaining motifs can be computed from 
this set of motifs. The definition and the statement of the theorem discussed in "Some 
Results on Flexible-Pattern Discovery" are repeated here. 

The notions of redundancy and the basis set will now be defined. 
Informally speaking, a motif m can be called redundant if m and its location list £ m can be 
deduced from the other motifs without studying the input string s. This notion is 
introduced below and it is described below how the redundant motifs and the location 
lists can be computed from the irredundant motifs. 

Definition 12: redundant and irredundant motif. A maximal motif m, with 
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location list £ m is redundant if there exist maximal motifs mu 1 < i <P,P > h such that 
£ m =£mi U£« 2 ...U£»i p and m<rrii for all i. A maximal motif that is not redundant is 
called an irredundant motif. 

Notice that for a rigid motif p > 1 (p in Definition 12), each location list 
corresponds to exactly one motif, whereas, for a flexible motif, p could have a value of 
one. For example, let s = axfygsbapgrftb. Then m x = aS x > 3] f. [l > 3] b, m 2 = aS l > 3] gS l > 3] b, and 
m 3 =a....b with £ mi = £ m2 = £ OT3 = {1,8}. But m 3 is redundant, since m 3 :< mi, m 2 . Also 
m\%m 2 and m 2 £m u hence both mi and m 2 are irredundant although £ m = £ m2 . This 
also illustrates the case where one location list corresponds to two distinct flexible motifs 
(motifs mi and m 2 are distinct if mi = m 2 does not hold). 

Generating operations. The redundant motifs need to be generated from 
the irredundant ones, if required. The following generating operations are now defined. 
The binary OR operator ® is used in the algorithm in the process of motif detection and 
the AND operator © in the generation of redundant motifs from the basis. 

Given an input sequence s, let m, mi, and m 2 be motifs. The binary AND 
operator, mi © m 2 , is defined as follows: m = mi © m 2 , where m is such that m<mum 2 
and there exists no motif m' with m<m l . For example, if m x =A.DS 2 ^G and 
m 2 =AB..JFG, then m = wi ®m 2 =A.J 2 ^G. The Binary OR operator, mi®m 2 , is 
defined as follows: m = mi <g> m 2 , where m is such that m\,m 2 <m and there exists no 
motif m f with m'^m. For example, if mi=^..Z)..G and m 2 =AB...FG, then 
m = m i <g> m 2 = AB.D.FG. 

Definition 13: basis. Given an input sequence s, let M be the set of all 
maximal motifs on s. A set of maximal motifs B is called a basis of M if and only if the 
following hold: (1) for each meB,mis irredundant with respect to B - {m}; and (2) let 
G(X) be the set of all the redundant maximal motifs generated by the set of motifs X, then 
M= G(B). 

The following theorem has been proved in "Some Results on 
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Flexible-Pattern Discovery" (which has been previously incorporated by reference), and 
only the statement of the theorem is given here. 

Theorem 1. Let s be a string with n = \s\ and let B be a basis or a set of 
irredundant flexible motifs. Then B is unique and |5| = 0(ri). 

A useful corollary to this theorem is presented below. 

Corollary 1. Given an input sequence of length n, let M be a set of motifs, 
not necessarily maximal, with the following properties: (1) for each p,qeM,p±q, let p ( 
be a suffix string of p andy £ q, unless |£ p | * |£ ? |; and (2) there does not exist peM such 
that £ p = \j£ gi andp< q t for all i. Then \M\ = 0(n). 

This result is used in the methods of the present invention to bound the 
number of non-maximal motifs at each iteration of the methods. Next, two problems on 
sets, the Set Intersection Problem (SIP) and the Set Union Problem (SUP), are described. 
These are used in the pattern discovery methods discussed below. 

The Set Intersection Problem, SIP(n,m,/). Given n sets S\,S 2 , ... 9 S n , on m 
elements, find all the N distinct sets of the form S h n S h n ... (1 & p with p > I Notice that 
it is possible that N= 0(2 W ). An algorithm having a complexity of OQsflogn + mn) will 
now be described. This algorithm obtains all the intersection sets. 

Given n sets Si,S 2 , on m elements, find all the N distinct sets of the 

form S ix C\Si 2 f\...C\S ip , with p>l Let the elements be numbered l„j». Construct a 
binary tree 1 using the subroutine CREATE-NODE shown below. Assume a function 
CREATE-SET(jS) which creates S, a subset of 5i, in an appropriate data 
structure D (for instance, a tree data structure). A query of the form "if a subset S e £>" 
(i.e., DOES-EXISTOS)) returns a True or False in time O(logw). 
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Node CREATE-NODE (S; h; I) 

{ 

(1) New(this-node) 

(2) CREATE-SET(S) 

(3) LetS' = {S i eS\heS i } 

(4) if ((£' > I) and not DOES-EXIST(5") and (h > 2)) 

(5) Left-child = CREATE-NODE^' ; h-l;t) 

(6) Right-child = CREATE-NODE^; h-1 ; I) 

(7) return (this-node) 

} 

For / = 2, there is exactly one node the tree *T. For / > 2, the initial call is 
CREATE-NODE (Si,S 2 ,...,S„; m; I). Clearly, all the unique intersection sets, which are 
N in number are at the leaf nodes of this tree 1. Also, the number of internal nodes can 
not exceed the number of leaf nodes, N. Thus, the total number of nodes of 1 is 0(N). 
The cost of query at each node is O(logn) (line (4) of CREATE-NODE). The size of the 
input data is 0(nm) and each data item is read exactly once in the algorithm (line (3) of 
CREATE-NODE). Hence, the algorithm takes 0(Nlogn + nm) time. A tree structure 
created by an SEP algorithm is discussed in reference to FIG. 7. 

The Set Union Problem, SUP(n,m). Given n sets Si,S 2 ...,S„ on m 
elements each, find all the sets St such that \jS h U ... US ip i* i h 1 <j<p. An 

algorithm is now presented that solves this problem in time 0(n 2 m). 

For each set Su one first obtains the following sets Sj,j±i,j= !...«, such 
that Sj c St. This can be done in 0(nm) time (for each i). Next, check if Uy Sj = S { . Again, 
this can be done in 0{nm) time. Hence, the total time taken is 0(n 2 m). 

Pattern Discovery 

The techniques of the present invention can be described as follows. One 
technique begins by computing solid character patterns and then successively grows them 
by concatenating with other patterns until patterns cannot be grown any further. 
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Additionally, during the growing process, don't care characters and flexible portions may 
be added. Unfortunately, the number of patterns at each step grows very rapidly. This 
problem is ameliorated by first computing only the basis set. This is done by trimming the 
number of growing patterns at each step and using Theorem 1 to bound their number by 
0(ri). Thus in time 0(n 5 log n), the basis can be detected. Note that this is proportional to 
only the input. In the next step, the remaining motifs from the basis are computed in time 
"proportional" to their number. 

Computing the Basis flrredundanf) Motifs 

The input parameters are: (1) the string, s, (2) the minimum number of 
times a pattern must appear, k, (3) the flexibility of the dot characters, A. Recall that each 
element of s is a character or a set of characters from the alphabet 2 or even real numbers. 
If the input is a sequence of real numbers, this problem can be mapped onto an instance 
of a pattern discovery problem on strings of sets of characters. This is discussed in Parida 
et al, "Pattern Discovery on Character Sets and Real- Valued Data: Linear Bound on 
Irredundant Motifs and an Efficient Polynomial Time Algorithm," Eleventh ACM-SIAM 
Symposium on Discrete Algorithms (SODA), 297-308 (2000), the disclosure of which is 
incorporated herein by reference. Thus the treatment discussed herein also extends to 
flexible patterns on real number sequences. The flexibility property has the following 
interpretation: given a flexibility of A, accept dot character annotations of the following 
form [<zi,a 2 ], where (a 2 -ai)<A. For the rest of the description, assume that the 
alphabet size is | £ | = 0(1). 

The following notation is used. Given a motif m (not necessarily 
maximal), F(m) denotes the first element of m and E(m) denotes the last element of m. 
Note that F(m) *V and E(m) * The location list £' m = {(i,J)\m' is the realization of m 
that occurs at i and ends at/}. Note that the location list £ m = e £ m }. 

Turning now to FIG. 3, a method 110 is shown for determining basis 
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motifs from an input sequence. Method 110 is used by a system, such as system 100 of 
FIG. I, to determine basis motifs. As discussed above, the input sequence, from which 
method 110 creates basis motifs, may be real numbers, a DNA sequence, a protein 
sequence, encrypted files, or any other sequence having an alphabet. 

Method 110 begins in step 305, where solid element motifs are created. A 
solid element motif comprises one or more solid elements, which could be sets or 
characters. Generally, two elements are used, but the method 110 may also start with 
fewer or more elements per solid element motif. Broadly, in step 305, for every a e 2, 
construct m = a and £' m = {(/, f)\s[q = a }. F(m) = E(m) = a . This step takes 0(n) time. 

Step 310 is optional and is only required while dealing with strings on sets 
of characters, hi step 310, for sets of characters, common sets are determined. For 
example if mi = {b,c,d} and m 2 = {b,c,e}, step 310 checks to see if m = {b,c} exists. 
Note that £ m = £ m {j£ m2 , while the characters in m are the intersection of the sets of 
characters in mi and m 2 . This problem can be solved using the Set Intersection Problem, 
SIP (| 2 1, k, 2). Assuming 1 2 1 = 0(1), this step takes 0(n 2 ) time. 

In step 315, don't care characters are added to the solid character motifs to 
create rigid motifs. For instance, an input sequence could be abced and a solid character 
motif determined in step 305 might be ab. In step 310, the rigid motif that results after 
adding a don't care character, to the solid character motif aft is a.c. Note that the rigid 
motif a.c is a pattern in the input sequence. 

Step 315 is stated in more mathematical terms as follows. Let m = m\ d 2 
denote the string obtained by concatenating the elements m x followed by d '.' characters 
followed by the element m 2 . For d = 0...n, construct the motif m = m^ and the location 
list £ « w , = i(x,x + d)\(x,x) e £' mi , (x + d,x + d)e £' mJ } with F(m) = F{m t ) and E(m) = E(mj). 
This takes 0(n 2 ) time, and the number of motifs at this step is 0(ri). 

In the case of flexible motifs, step 320 is performed. In this step, flexible 
motifs are constructed in the following manner. Construct sets of motifs P such that for 
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all m h mj<zP, F(m)=F(mi) and E(mi) = E(mj). For each such set P, for / = 0...?z-A, 
m = mi^m h £ ! m = ujfcf A £ f miIJ and F(m) = F( mi ) 9 E(m) = E(mj). This takes 0(n 2 ) time 
and the number of motifs at this step is 0(ri). 

In step 325, concatenation is performed on the motifs created in previous 
steps. This forms larger motifs. Basically, motifs are concatenated when a "junction" 
element of a first motif is the same as a "junction" element of a second motif. For 
instance, if the last element of a first motif is the same as the first element of a second 
motif, the motifs can be concatenated. This will also hold if each element of a motif is a 
set of characters. It should be noted that the alternative, i.e., if the first element of a first 
motif is the same as the last element of a second motif, also means that the two motifs 
may be concatenated. 

In mathematical terms, step 325 is described as follows. Consider every 
pair of motifs mi mdm 2 with E(m i) < F(m 2 ) or F(m{) < E(m 2 ). Let / = \m 1 1. Define as 
follows: 



lfE(mi)<F(m 2 )ihen 



m[i] = 



IfF(mi)<E(m 2 )thm 



m\i] 



m\[i] i<l 
m 2 [z-/+l] i>l 



m\[i] i<l 
m 2 [i-l+l] i>l 



For character motifs, which means that each motif is a string of single characters, the 
formula E(mx)<F(m 2 ) is actually E(mi) = F(m 2 ). The two cases above are general 
enough to include sets of characters. 

In step 330, trimming is performed on motifs created until this point. This 
is the pruning step. There are several kinds of pruning performed: (1) where all suffix 
motifs are removed; (2) where all the "redundant" motifs are removed; and (3) where all 
motifs that occur less than k times are removed. For the first pruning, every location list is 



YOR920010446US2 



-16- 



offset to zero and the offset location lists are checked for identity. If identity is found, the 
location lists with the identity are augmented. An example of this is as follows. If the 
motif ah has a location list of l ah = { 1, 7, 15}, and also the motif bb has a location list of 
£bb = {2, 8, 16}, then both of these motifs will have a location list of £ = {0, 6, 14} when 
their respective location lists are offset to zero. Essentially, this means that the motif bb is 
a suffix for the motif a.b. The two motifs are augmented by creating a new motif of abb 
that has a location list of £ abb = { 1, 7, 15}. Removing the suffix motifs ensures that motifs 
have maximal distance. 

The pruning where all the "redundant" motifs are removed will now be 
described. Let L denote all the location lists of the motifs constructed in previous steps. 
Using the Set Union Problem, SUP(|L|,w), remove all the motifs whose location list is 
exactly the union of some other location lists. If £ m = (j£m, remove m and update each m z 
as nii = mi®m and if \m\ > |m,-|, E{m x ) = E{m). For example, if m\ = a.b, m 2 = a..c, and 
m = a...d with L m =L m (J£m 2 then m x is updated as m\ = ah.d and m 2 is updated as 
mi = a..cd. 

The pruning where all motifs that occur less than k times are removed is 
self-explanatory. Step 330 takes 0(n 3 ) time and the number of motifs at this step is 0(n). 

In step 335, it is determined if any additional motifs have been created 
from the previous steps. If not (step 335 = NO), the method ends. If one or more 
additional motifs were created (step 335 - YES), then the method continues in step 325. 
The number of iterations is on the order of log /where J is the length of the longest motif 
in s. Since J is bounded by n, method 110 takes 0(n 4 logn) to detect the basis for rigid 
motifs andO(n 5 log/*) in the case of flexible motifs. It should be noted that the techniques 
presented herein for determining basis motifs are not output-sensitive, but are efficient. 

Referring now to FIG. 4, an example is shown that applies method 110 to 
a sequence 465 of data. The example of FIG. 4 is used to help explain method 110. 
Sequence 465 is an exemplary sequence of characters, where the sequence has the 
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alphabet 2 = {a,b,c,d,x,y}. Reference 460 is used to help identify locations in sequence 
465. For the sake of brevity, all results for each step of method 1 10 are not shown in FIG. 
4. Thus, when "results" are discussed below, the results may not be a complete set of 
results for the step being discussed. FIG. 4 assumes that k = 2 (i.e., motif sets must 
contain at least two motifs) and d = 1 (i.e., there is allowed only one dot character 
between alphabet characters). 

Results 405 are formed during step 305 of FIG. 3, where solid element 
motifs are created. In this example, two-character solid motifs are formed instead of 
single-character motifs. It should be noted that, after the pruning step (step 330 of FIG. 3) 
and because k = 2, the only motifs that will remain are ab, be, ed, and dc. Results 415 are 
formed during step 315 of FIG. 3, where don't care characters are added. For instance, ab 
is converted to a.c and be is converted to b.a. It should be noted that, after the pruning 
step, the only motifs that will remain are a.c and c.c. 

Results 420 are formed during step 320 of FIG. 3, where flexible motifs 
are created. Two valid flexible motifs are shown in results 420. The motif aS°^b is 
created from ab (having location list {1,4}) and a. b (having location list {20}). Similarly, 
the motif a.^c is created from ac (having location list {20}) and a.c (having location 
list {1,4,8,13}). 

Results 425 are formed during step 325 of FIG. 3, where motifs from 
previous steps are concatenated. Results 425 are an incomplete list. The motif abc is 
formed by concatenating ab and be. Additional examples are the following: a.c could be 
concatenated with ca to create a.ca; and a.c could be concatenated with c.b to form a.c.b. 

If method 110 is executed until completion, rigid basis motifs 440 and 
flexible basis motifs 445 will result, after multiple pruning steps 330 of FIG. 3. 

Techniques for determining basis motifs have now been presented. As 
previously discussed, the basis motifs are an irredundant set of motifs and are similar to 
basis vectors. Once the basis motifs have been determined, the redundant motifs may be 
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determined. 



Computing Redundant Maximal Patterns 

A redundant maximal motif m is of the form m\ © mi © ... © m p for some 
p and £ OT =£ OTl U£m 2 U... U£ mp . It is possible to create redundant maximal motifs 
through a "brute force" method of combining every possible basis motif. However, an 
example is given below to show that a straightforward approach of combining (using the 
operator © ) compatible motifs does not give the desired time complexity. Two motifs m x 
and mi are compatible, without loss of generality, if m i [1] < m 2 [l] and there is i such that 
wip] m 2 \i] *V and mi[i] < m 2 [i] < i < min(|wi|, \m 2 \). 

The following example illustrates that a simple combination of motifs is 
wasteful. Let m x =ab...d, m 2 =a...cd, m 2 = a.e..d, m 4 = a..f.d, with £ W] = {10,20}, 
£ m2 = {30,40}, £ W3 = {20,40}, £ ra4 = {10,30}. Then £. s = {£,,, Uf^Uf., Uf.*} 
£m 6 = {£ W2 U£« 3 U£ W4 }, £m 7 = {£ mi UfmjUf^}, £ Wg = {£ m , U £m 2 U £m 4 } , and 
£« 9 = {£*ki U £w 2 U £w 3 } are such that ms=me = mi = m%=m^= a...d. In other words, 
the motif m 5 is constructed at least four more times than required. 

The following is an output-sensitive algorithm to compute all the 
redundant motifs. Referring now to FIG. 5, a method 120 is shown for determining 
maximal redundant motifs from basis motifs. Method 120 would be performed by a 
system such as system 100. Note that method 120 is optional, as the redundant motifs do 
not have to be determined. Simplistically, method 120 may be explained as follows. The 
set of basis motifs is split into subsets, each subset containing motifs that have the same 
first starting element. The motifs in a subset are aligned and placed into a table. Rows of 
the table are basis motifs and columns of the table are elements in the basis motifs. The 
term "elements" includes don't care characters, flexible don't care portions, and sets. If a 
column of the aligned motifs has the same solid character in more than one row, the 
motifs corresponding to the rows having the same solid character are collected into a set. 



YOR920010446US2 



-19- 



This set will be called a motif set herein, simply to distinguish it from other sets. There 
will generally be multiple motif sets. The Set Intersection Problem (SIP) is used to 
determine a number of unique intersection sets from the motif sets. Each motif set 
corresponds to a maximal redundant motif, as do the unique intersection sets. After 
method 120 is discussed in more detail, additional examples will be given and discussed. 

Method 120 begins in step 505, where a subset of the basis motifs is 
formed. Given B the set of all the irredundant motifs, construct p, a set of subsets of B, 
as follows: P e p , if for each motif m u mj e P, without loss of generality, < F(mj) 
and mi £ ntj, and P is the largest such set. For each Pep, construct an instance of the 
Set Intersection Problem (SIP) as follows. 

For each Pe p do the following. Let l = max mEp \m\. Construct m [i], 
2 < i < I as follows, fh [i] = {a *\'|<r < p[i],p e P}. Note that it is possible that m [i] = { } 
for some i. Now construct an instance of SIPQV,M,2) as follows. The M elements on 
which the motif sets are built is a subset of the basis set and M= \P\. The N' motif sets are 
constructed as follows (step 510). & = {ntt\ m [/"] = e) for all possible values of j and e 
and \S{\ > 2. Assuming that I = 0(1), the number of such motif sets N' = 0(n). Recall 
that n is length of the input string s whose motifs are being discovered. The unique 
intersection sets are discovered in step 515 through an instance of the SIP(A^,M,2). The 
SIP has been described above, but is further described in reference to FIG. 7. Each S j e 
with 1^1 > 2 corresponds to a maximal redundant motif (step 520). Although the same 
location lists may give distinct flexible motifs, this does not cause any problems since the 
solid characters of the motifs in P are used. In step 520, the unique intersection sets 
discovered in step 5 15 are used to determine additional redundant motifs. 

In step 525, it is determined if there are additional basis motifs that have 
not be part of a previously used subset. Step 525 is similar to step 335 of FIG. 3. If there 
are additional basis motifs (step 525 = YES), then the method continues at step 505. Note 
that motif sets of size < k are not formed (in this example, k = 2 has been used). If there 
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are no additional basis motifs (step 525 = NO), then the method ends. At this point, the 
redundant maximal motifs for the input sequence have been determined. 

The union of the solutions to each of the SIP gives all the maximal 
redundant motifs in time 0(N log n). Recall that N is the number of maximal motifs and 
n is the length of the input sequence. Thus, method 120 is output-sensitive because its 
complexity and hence computation time depend on the output as well as the input. 

Several examples are now given to help further explain method 120. 
Referring now to FIG. 6, a subset 610 is shown. Subset 610 comprises five basis motifs, 
m h m 2 , m 3 , m 4 , and m 5 . Each of these basis motifs starts with the same character, a. Each 
motif is aligned to this first character. Motif sets 620 are created in the following manner. 
Wherever there is a column containing characters that are the same, the basis motifs 
corresponding to the equivalent characters are gathered into a set. For instance, to 
determine motif set Sj, note that column 1 contains the character b in four locations. 
These four locations correspond to the motifs m h m 2 , m 3 , and m 4 , which form motif set S h 
To determine motif set S 2 , note that column 4 contains the character b in two locations. 
These two locations correspond to the motifs m, and m 2 , which form motif set S 2 . To 
determine motif set S 3 , note that column 4 contains the character c in three locations. 
These three locations correspond to the motifs m 3 , m 4 , and m 5 , which form motif set S 3 . 

Now that motif sets have been determined, unique intersection sets are 
determined. A tree 700 is constructed and shown in FIG. 7. Tree 700 comprises nodes 
710, 720, 730, 740, and 750 and "leaves" 711, 721, 731, 741, and 751. Each leaf 
corresponds to an intersection set, and, consequently, the term "intersection set" will be 
used herein. At each node, a group 760 of motifs is used to determine intersection sets. 
This group 760 is the subset 610 of motifs shown in FIG 6. The group 760 is a set of 
motifs that can be numbered or assigned to columns arbitrarily from l...m, where m is the 
total number of motifs. At each node of the tree, only one motif from a group is 
considered. In the example of FIG. 7, a box is placed around the motif that is being 
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considered from group 760. 

At the root node 710, for instance, one of the motifs in the group 760 is 
selected and considered. At the next node 720, another of the motifs in the group 760 is 
selected and considered. This process continues through each node. At each node, a left 
child group is created from the intersection set of all those motif sets that has the motif 
being considered (e.g., motif provided this intersection has not already been created. 
The right child is the parent set minus the motif already considered. This process ensures 
that the tree can have a depth, measured in nodes, no bigger than m. 

Although the motifs may be selected at random, the present example will 
start with the first motif, m h end with the last motif, m 5 , and process the motifs in order. 
At node 710, for the selected basis motif m h there is one intersection set 711 of {S U S 2 } 
(i.e., only sets S } and S 2 contain basis motif m,). At node 720, for the selected basis motif 
m 2 , there is an intersection set 721 of {S U S 2 }; however, this set is not unique and is 
discarded. At node 730, for the selected motif m 3 , there is an intersection set 731 of 
{Si, S3}. At node 740, for the motif m 4 , there is an intersection set 741 of {Si, S3}; 
however, this intersection is not unique and is discarded. The final intersection set 751 is 
{S3 }, but this intersection set is also not unique and it does not have a size greater than 
one. Thus, the unique intersection sets are {Si,S 2 } and {Si, S3 }. 

The redundant motifs are then determined from S h S 2 , S 3 , {Si,S 2 }, and 
{Si, S3} as follows. The intersection of Si and S 2 is Si r\S 2 = {m u m 2 } = ab..b, with 
location list £ = £ Wl u£« 2 . The intersection of S 2 and S 3 is Si n£s = {m3,m 4 } = a...c, 
with location list £ = £ W3 U£^ 4 . The intersection of the motif set Si is 
Si = {mi,m 2 ,m3,m A } = ab, with location list £ = £ Ml u £ m2 U £ m > U £« 4 . The intersection 
of the motif set S 2 is S 2 = {mi,m 2 } = ab..b, but this motif has already been determined. 
The intersection of the motif set S 3 is S3 = {w 3 ,W4,m 5 } =a....c, with location list 
£/«3 U £»i4 U £»i5- 

As another illustration, a second example involving rigid motifs is shown 
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in FIG. 8. As shown in FIG. 8, let mi = abed, m 2 = abe, m 3 = add.d, m 4 = ad..e, and 
m 5 =ab..d. Here / = 5 and S\ = {mi, m 2 ,m 5 }, S 2 d ={m 3 ,m 4 }, S 5 d = {mi,m 2 ,m 5 }. Each of 
the motif sets corresponds to a maximal redundant motif. For example $1 gives the 
maximal redundant motif of mi ©m 2 © m 5 =ab with location list l mi \jL m2 \jL ms , Sj 
gives the maximal redundant motif of m 3 ®m 4 = ad with location list L m {jL m , S d give 
mi ®m3®m 5 =a...d with location list L mi {jL m l)L ms . The results from SIP give the 
unique intersection set {m u m 5 } and this corresponds to the motif m = mi ©m 5 = ab..d 
m±L m =L m , (jL m . 

Consider an example using flexible motifs. This example is shown in FIG. 
9. As shown in FIG. 9, let mi= abP-^ec, m2 = ab. lh2] bc, mi = ab.™be, and 
m 4 = a. [l ' 3 ^cb. Here /= 5. The different motif sets are the following: S 2 b = {mx,m 2 ,mz}, 
Si = {m2,m3,m 4 }, S 5 C = {mi,m2}. Each of the motif sets corresponds to a maximal 
redundant motif. The motif set Sj gives mi®m 2 @m 3 =ab, with location list 
L m2 \jL m \jL mA . The motif set S\ gives m 2 @m 3 0m 4 =a. [1 ' }] S l '^b = aP'^b 5 with 
location list L mi \jL m \jL m ,. The motif set S 5 C gives mi@m 2 = abS l ' 3 \^c = ab.^c 
with location list L mi UL m . The intersection results from SIP gives {m 2 ,mi} with 
m = m 2 e 7w 3 = ab. [l ' 3 ^b and location listi m2 UL m3 . 

Exemplary System 

Turning now to FIG. 10, a block diagram is shown of a system 1000 for 
determining irredundant and redundant motifs in accordance with one embodiment of the 
present invention. It should be understood that system 1200 represents one embodiment 
for implementing system 100 of FIG. 1. System 1000 comprises a computer system 1010 
and a Digital Versatile Disk (DVD) 1050. Computer system 1010 comprises a processor 
1020, a memory 1030 and a video display 1040. Computer system 1010 comprises a 
processor 1020, a network interface 1025, a memory 1030, a media interface 1035, and an 
optional display 1040. Network interface 1025 allows computer system 1010 to connect 
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to a network, while media interfaces 1035 allows computer system 1035 to interact with 
media such as a hard drive or DVD 1 050. 

As is known in the art, the methods and apparatus discussed herein may be 
distributed as an article of manufacture that itself comprises a computer-readable medium 
having computer-readable code means embodied thereon. The computer-readable 
program code means is operable, in conjunction with a computer system such as 
computer system 1010, to cany out all or some of the steps to perform the methods or 
create the apparatuses discussed herein. The computer-readable medium may be a 
recordable medium (e.g., floppy disks, hard drives, optical disks such as DVD 1050, or 
memory cards) or may be a transmission medium (e.g., a network comprising 
fiber-optics, the world-wide web, cables, or a wireless channel using time-division 
multiple access, code-division multiple access, or other radio-frequency channel). Any 
medium known or developed that can store information suitable for use with a computer 
system may be used. The computer-readable code means is any mechanism for allowing a 
computer to read instructions and data, such as magnetic variations on a magnetic 
medium or height variations on the surface of a compact disk, such as DVD 1050. 

Memory 1030 configures the processor 1020 to implement the methods, 
steps, and functions disclosed herein. The memory 1030 could be distributed or local and 
the processor 1020 could be distributed or singular. The memory 1030 could be 
implemented as an electrical, magnetic or optical memory, or any combination of these or 
other types of storage devices. Moreover, the term "memory" should be construed broadly 
enough to encompass any information able to be read from or written to an address in the 
addressable space accessed by processor 1010. With this definition, information on a 
network, accessible through network interface 1025, is still within memory 1030 because 
the processor 1020 can retrieve the information from the network. It should be noted that 
each distributed processor that makes up processor 1020 generally contains its own 
addressable memory space. It should also be noted that some or all of computer system 
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1010 can be incorporated into an application-specific or general-use integrated circuit. 

Optional video display 1040 is any type of video display suitable for 
interacting with a human user of system 1000. Generally, video display 1040 is a 
computer monitor or other similar video display. 

It is to be understood that the embodiments and variations shown and 
described herein are merely illustrative of the principles of this invention and that various 
modifications may be implemented by those skilled in the art without departing from the 
scope and spirit of the invention. 
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